Skip to main content

View Post [edit]

Poster: mark_k Date: Oct 7, 2003 12:16am
Forum: web Subject: Broken/truncated .zip files in Wayback Machine

Hi,

Here's a problem that I've noticed when using the Wayback Machine. Basically, files which end with a zero byte are truncated.

Zip archive files normally end with a zero byte. If that last zero byte is not present, many or most archiver programs think the file is corrupted/bad.

I have downloaded several .zip archives from the Wayback Machine, and all are truncated by that one byte. This indicates that the web-crawling software used by the Internet Archive is (or was) defective.

If you have downloaded such a file, you can fix it by appending a zero byte to the end of the .zip file.


Regards,
-- Mark

Reply [edit]

Poster: Brad Date: Dec 24, 2003 7:29am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

Hi Mark,

Definitely possible that various crawlers that have accumulated the web archives over the years have had faults.

Another common cause for this kind of problem, and that may be what's occurring here, is that many of the crawlers stopped downloading files at 1MB, and that's all we have in the collection. Quite a calamity, but that's where we are.

If a zip file ends at 1MB exactly (or within a few bytes: sometimes 1MB includes the HTTP header, for webnerds) then this may be the problem. Generally, there are internal consistancy checks(CRCs) in the zip files themselves, so if your solution of adding a null byte to the end of the file seems to get you a usable zip file, then that's great! Please let me know if this is the case, by responding here, as we may be able to auto-detect and fix this problem as we serve the files, in the future.

Thanks for posting your solution.

Brad

Reply [edit]

Poster: Z19 Date: Dec 31, 2005 2:18am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

On December 24, 2003 03:29:40pm Brad wrote:
> we may be able to auto-detect and fix this problem as we serve the files, in the future

Two years later, this has been reported numerous times, and you are still serving zip files one byte short. Is this ever going to be fixed?

Reply [edit]

Poster: wjgeorge Date: Jun 28, 2007 9:40pm
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

1) Copy the file you want to fix to a file named "fixme"
2) Run the fixme.pl script:

open(FOO,">>fixme");
binmode(FOO);
syswrite(FOO,chr(0),1);

3) Test with your zip program
4) rename or copy the file named "fixme" to your orginal file name




This post was modified by wjgeorge on 2007-06-29 04:40:01

Reply [edit]

Poster: Z19 Date: Jun 28, 2007 11:25pm
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

Well, thanks, but I can fix the files myself.
The problem is, why doesn't the Archive fix their files?

Why doesn't anyone at the Archive even bother to reply?

It's only been an outstanding, known problem with a simple fix for at least 4 years.

Reply [edit]

Poster: mark_k Date: Dec 24, 2003 8:39am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

Hi,

The problem is definitely not related to the 1MB file size issue; the files I examined were 200K or so long, from memory.

As I understand it, in the .zip file format there is a list of filenames in the archive right at the end of the file. And presumably the file format requires a trailing zero byte.

Some/most archiver programs report an error like "end of central directory signature not found" (from memory) when you try to work with a file missing its final zero byte.

If you like, I can send some example web.archive.org URLs by private email so you can investigate this issue yourself.

Regards,
-- Mark

Reply [edit]

Poster: Dolphin Date: Dec 23, 2003 10:19am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine

That's great news. It's hard enough as it is to get a zip file from Wayback Machine because you almost always get the cannot connect error. Now how does one append a zero byte to a zip file?